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ABSTRACT 

We present Broccoli, a fast and easy-to-use search engine for 
what we call semantic full-text search. Semantic full-text 
search combines the capabilities of standard full-text search 
and ontology search. The search operates on four kinds of 
objects: ordinary words (e.g., edible), classes (e.g., plants), 
instances (e.g., Broccoli), and relations (e.g., occurs-with or 
native-to). Queries are trees, where nodes are arbitrary bags 
of these objects, and arcs are relations. The user interface 
guides the user in incrementally constructing such trees by 
instant (search-as-you-type) suggestions of words, classes, 
instances, or relations that lead to good hits. Both standard 
full-text search and pure ontology search are included as 
special cases. 

In this paper, we describe the query language of Broccoli, 
a new kind of index that enables fast processing of queries 
from that language as well as fast query suggestion, the nat- 
ural language processing required, and the user interface. 
We evaluated query times and result quality on the full ver- 
sion of the English Wikipedia (32 GB XML dump) combined 
with the YAGO ontology (26 million facts). We have imple- 
mented a fully functional prototype based on our ideas, see 
|http : //broccoli . i nf ormat ik . uni-f re iburg . de 

Categories and Subject Descriptors 

H.3.1 [Information Storage and Retrieval]: Context 
Analysis and Indexing — Indexing methods, Linguistic pro- 
cessing; H.3.3 [Information Storage and Retrieval]: In- 
formation Search and Retrieval — Query formulation, Retrieval 
models, Search process; H.5.2 [Information Interfaces and 
Presentation]: User Interfaces — Theory and methods 

General Terms 

Algorithms, Design, Experimentation, Human Factors, Per- 
formance 

Keywords 

Semantic full-text search, search as you type, Wikipedia 



1. INTRODUCTION 

In this paper, we describe a novel implementation of what 
we call semantic full-text search. Semantic full-text search 
combines traditional full-text search with structured search 
in knowledge databases or ontology search as we call it in 
this paper. 

In traditional full-text search you type a (typically short) 
list of keywords and you get a list of documents containing 
some or all of these keywords, hopefully ranked by some no- 
tion of relevance to your query. For example, typing broccoli 
leaves edible in a web search engine will return lots of web 
pages with evidence that broccoli leaves are indeed edible. 

In ontology search, you are given a knowledge database 
which you can think of as a store of subject-predicate-object 
triples. For example, Broccoli is-a plant or Broccoli native- 
to Europe. These triples can be thought of to form a graph 
of entities (the nodes) and relations (the edges) , and ontol- 
ogy search allows you to search for subgraphs matching a 
given pattern. For example, find all plants that are native 
to Europe. 

Many queries of a more "semantic" nature require the 
combination of both approaches. For example, consider the 
query plants with edible leaves and native to Europe, which 
will be our running example in this paper. A satisfactory an- 
swer for this query requires the combination of two kinds of 
information. First, a list of plants native to Europe. This is 
hard for full-text search but a showcase for ontology search, 
see above. Second, for each plant the information whether 
its leaves are edible or not. This kind of information can be 
easily found with a full-text search for each plant, see above. 
But it is quite unlikely (and unreasonable) to be contained 
in an ontology, for reasons explained in Section [2. 3| 

The basic principle of our combined search is to find con- 
textual co-occurrences of the words from the full-text part 
of the query with entities matching the ontology part of the 
query. Consider the sentence: The stalks of rhubarb are ed- 
ible, but its leaves are toxic. Assume for now that we can 
recognize entities from the ontology in the full text (we come 
back to this in Section [3.2[ |. In this case, the two underlined 
words both refer to rhubarb, which our ontology knows is 
a plant that is native to Europe. Obviously, this sentence 
should not count as evidence that rhubarb leaves are edible. 
We handle this by decomposing each sentence into what we 
call its contexts: the parts of the sentence that "belong" 
together. In this case the stalks of rhubarb are edible and 
rhubarb leaves are toxic. An arc from the query tree now 
matches if and only if its elements co-occur in one and the 
same context; this will be explained in detail in Section |5.3| 
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Broccoli 

Ontology: Broccoli 
Broccoli: is a plant; native to Europe. 
Document: Edible plant stems 
The edible portions of Broccoli are 

well as the leaves. 




Cabbage 

Ontology: Cabbage 

Cabbage: is a plant; native to Europe. 

Document: Cabbage 

The only part of the plant that is normally eaten is the leafy head. 



Figure 1: A screenshot of the final result for our example query. The box on the top right visualizes the 
current query as a tree. There is always one node in focus (shown in bold), in this case, the root of the tree. 
The large box below shows the hits grouped by instance (of the class from the root node) and ranked by 
relevance (if Broccoli is among the hits, we always rank it first). Evidence both from the ontology and the 
full text is provided. For the latter, a whole sentence is shown, with parts outside of the matching context 
grayed out. With the search field on the top left, the query can be extended further. The four boxes below 
provide context-sensitive suggestions that depend on the current focus in the query, here: suggestions for 
subclasses of plants, suggestions for instances of plants that lead to a hit, suggestions for relations to further 
refine the query. One of the suggestions is always highlighted, in this case the cultivated-in relation. It can 
be directly added to extend the query by pressing Return. 



Figures [T] and [2] show screenshots of our search engine in 
action for our example query. The figure and its caption also 
explain how the query can be constructed incrementally in 
an easy way and without requiring knowledge of a partic- 
ular query language on the part of the user. We encour- 
age the reader to try our online demo on http : //broccoli . 
inf ormat ik . uni- f reiburg . de 

1.1 Our contribution 

Broccoli supports a subset of SPARQLQ (essentially trees 
with a single free variable at the root) for the ontology part 
of queries. Moreover, it allows a special occurs-with relation 
that can be used to specify co-occurrence of a class (e.g., 
plant) or instance (e.g., Broccoli) with an arbitrary com- 
bination of words, instances, and further subqueries. Both 
traditional full-text search and pure ontology search are sub- 
sumed as special cases. This gives a very powerful query 
language. See Section [1] for details. 

For the occurs-with relation, we provide a novel kind of 
pre-processing that decomposes sentences into contexts of 
words that belong together. In particular, this considers 
enumerations and sub-clauses. Previous approaches have 
used co-occurrence in a whole paragraph or sentence, or 
based on word proximity; all of these often give poor re- 



1 http : //www . w3 . org/TR/rdf - sparql-query 



suits. See Section [3] for details. 

We present a novel kind of index that supports fully in- 
teractive query times of around 100 milliseconds and less 
for a collection as large as the full English Wikipedia (32 
GB XML dump, 290 million contexts of the kind just de- 
scribed). Previous approaches, including adaptations of the 
classic inverted index, yield query times on the order of sec- 
onds or even minutes for the kind of queries we support on 
collections of this size. See Section [2. 1| for related work, and 
Section [5] for details. 

All the described features have been implemented into a 
fully functional system with a comfortable user interface. 
There is a single search field, as in full-text search, and sug- 
gestions are made after each keystroke. This allows the user 
to incrementally construct semantic full-text queries without 
prior knowledge of a query language. Results are ranked by 
relevance and grouped by instance, and displayed together 
with context snippets that provide full evidence for why that 
particular instance is shown. See Figures [1] and [2] for an ex- 
ample, and Section || for details. 

We provide experimental results on index construction 
times, query times, and result quality for the English Wiki- 
pedia combined with the YAGO ontology uM- For the qual- 
ity results, we use 15 queries from the TREC 2009 Entity 
Track benchmarks (e.g., Airlines that currently use Boeing 
747 planes). We achieve very good results, and we even 
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Figure 2: Snapshots of the query, search field, and suggestion boxes for three stations in the construction 
of our example query. Column 1: At the beginning of the query, after having typed plan. Column 2: After 
the class plant has been selected and the occurs-with relation has been added and having typed edible lea. 
Column 3: After having selected edible leaves. The focus automatically goes back to the root node. 



find many additional instances that were missing from the 
ground truth. See Section [7] for the details of our experi- 
ments. 

We want to remark that the natural language processing, 
the index, and the user interface behind Broccoli are com- 
plex problems each on their own. The contribution of this 
paper is the overall design of the system, the basic ideas 
for each of the mentioned components, an implementation 
of a fully functional prototype based on these ideas, and a 
first performance and quality evaluation providing a proof 
of concept. Optimization of the various components is the 
next step in this line of research; see Section [8] 

2. RELATED WORK 

Putting the work presented in this paper into context is 
hard for two reasons. First, the literature on semantic search 
technologies is vast. Second, "semantic" means so many dif- 
ferent things to different researchers. We roughly divide 
work in this broad area into four categories, and discuss 
each category separately in the following four subsections. 

2.1 Combined ontology and full-text search 

Ester [5] was the first system to offer efficient combined 
full-text and ontology search on a collection as large as the 
English Wikipedia. Broccoli improves upon Ester in three 
important aspects. First, Ester works with inverted lists 
for classes and achieves fast query times only on relatively 
simple queries; this is explained in more detail in Section 
|5.1| Second, Ester does not consider contexts but merely 
syntactic proximity of words / entities. Third, Ester's sim- 
plistic user interface was ok for queries with one relation, 



but practically unusable for more complex queries. 

Various other systems offering combinations of full-text 
and ontology search have been proposed. Semplore 20 sup- 
ports a query language similar to ours. However, elements 
from the ontology are not recognized in their contexts, but 
there is simply one piece of text associated with each in- 
stance (which would correspond to a single large context 
in our setting). Queries are processed with a standard in- 
verted index (see Section [5A]), and no particular UI is of- 
fered. In Hybrid Search [71, the full text and the ontol- 
ogy are searched separately with standard methods (Lucene 
and Sesame), and then the results are combined. There is 
no particular natural language processing. Concept Search 
[14| adds information about identified noun phrases and hy- 
ponyms to the index. Queries are bags of words, which are 
interpreted semantically. The query processing uses stan- 
dard methods (Lucene), with very long inverted lists for 
the semantic index items. GoNTogle 13 combines full text 



with annotations which are searched separately and then 
combined, similarly as in |7. Queries are bags of words. 
There is no full ontology search and no particular natu- 
ral language processing. 
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of- 



Faceted Wikipedia Search 
fers a user interface with similarities to ours. However, the 
query language is restricted, there is nothing comparable 
to our contexts but only a small abstract per entity like in 
[20| , and query processing is DB-based and very slow, de- 
spite the relatively small amount of data. SIREN (http: 
//siren, sindi ce . com[ l provides an integration of pure on- 
tology search into Lucene. How to combine the then possi- 
ble full-text and ontology searches is up to the user of the 
framework. Finally, systems like [2l] try to interpret a given 



keyword query semantically and translate it to a suitable 
SPARQL query for pure ontology search. 

2.2 Systems for entity retrieval 

Entity retrieval is a line of research which focuses on search 
requests and corresponding result lists centered around enti- 
ties (instead of around documents, as in traditional search). 
Since 2009, there is also a corresponding Entity Track at 
TRECQ The tasks of this track are both simpler and harder 
than what we aim at in this paper. 

They are harder because the overall goal is entity retrieval 
from web pages. The ClueWeb09 collection introduced at 
TREC 2009 is 25 TB of text. The relative information con- 
tent is low, however, as is typical for web contents. More- 
over, identifying a representative web page for an entity is 
part of the problem. 

To make the tasks feasible at all under these circum- 
stances, the queries are relatively simple. For example, Air- 
lines that currently use Boeing 747 planes^ Even then the 
tasks remain very hard, and, for example, NDCG&R figures 
average only around 30% even for the best systems [5]. 

Broccoli queries can be trees of arbitrary degree and depth. 
All entities that have a Wikipedia page are supported. And, 
most importantly, the query process is interactive, providing 
the user with instant feedback of what is in the collection 
and why a particular result appears. This is key for con- 
structing queries that give results of high quality. 

The price we pay is a more extensive pre-processing as- 
suming a certain "cleanliness" of the input collection. Our 
natural language processing currently requires around 1600 
core hours on the 32 GB XML dump of the English Wikipedia, 
see Section [73] And Wikipedia's rule of linking the first oc- 
currence of an important entity in an article to the respective 
Wikipedia article helps us for a high-quality entity recogni- 
tion; see Section |3.2| Bringing Broccoli's functionality to 
web search is a very reasonable next step, but out of scope 
for this article. 

Another popular form of entity retrieval is known as ad- 
hoc object retrieval [l7]. Here, the search is on structured 
data, as discussed in the next subsection. Queries are given 
by a sequence of keywords, similar as in full-text search, for 
example, doctors in barcelona. Then query interpretation 
becomes a non-trivial problem; see Section 2.4. 

2.3 Information extraction and ontology search 

Systems for ontology search have reached a high level of 
sophistication. For example, RDF-3X can answer complex 
SPARQL queries on the Barton dataset (50 million triples) 
in less than a second on average [16] . 

As part of the Semantic Web / Linked Open Data [8] 
effort, more and more data is explicitly available as fact 
triples. The bulk of useful triple data is still harvested from 
text documents though. The information extraction tech- 
niques employed range from simple parsing of structured 
information (for example, many of the relations in YAGO 
or DBpedia [2] come from the Wikipedia info boxes) over 
pattern matching (e.g., [I]) to complex techniques involv- 
ing non-trivial natural language processing like in our paper 



(e.g., [I]). For a relatively recent survey, see 18 



^http : //ilps . science .uva.nl/trec-entity 

3 In our framework these are queries with two nodes and one 
occurs-with edge. 



Our work differs from this line of research in two impor- 
tant aspects: (1) the full text remains part of the index that 
is searched at query time; and (2) our system is fully inter- 
active and keeps the human in the loop in the information 
extraction process. This has the following advantage: 

Ontologies are good for facts like which plants are native 
to which regions, who was born where on which date, etc. 
Such facts are easy to define and can be extracted from ex- 
isting data sources in large quantity and with reasonable 
quality. And once in the ontology, they are easily combin- 
able, permitting queries that would not work with full-text 
search. 

But for more complex facts like our broccoli has edible 
leaves, it is the other way round. They are easy to express 
and search in full text, but tedious to define, include, and 
maintain in an ontology. Let alone the problem of guessing 
the right relation names when searching for them. 

By keeping the full text, we can leverage the intelligence 
of the user at query time. The query Plant occurs-with edi- 
ble leaves does not specify the type of the relation between 
the occurrence of the plant and the occurrence of the words 
edible and leaves. Yet a moment's thought reveals that it is 
quite likely that a context matching these elements gives us 
what we want. Similarly as in full-text search, there is often 
no need to be overly precise in order to get what you want. 
And just like the result snippets in full-text search, Broc- 
coli's result snippets provide instant feedback on whether 
the listed plant is really one with edible leaves. 

Finally, if information extraction is desired nevertheless, 
Broccoli can be a useful tool for interactively exploring the 
collection with respect to the desired information, and for 
formulating appropriate queries. 

2.4 Systems for question answering 

Question answering (QA) systems provide similar func- 
tionality as our semantic full-text search. The crucial dif- 
ference is that questions can be asked in natural language, 
which makes the answering part much harder. The system 
is burdened with the additional and very complex task of 
"translating", in one way or the other, the given natural lan- 
guage query into a more formal query or queries that can be 
fed to a search engine and / or a knowledge database. 

The perfect QA system would obviate the need for a sys- 
tem like ours here. But research is still far from achieving 
that goal. All state-of-the-art QA systems, including the big 
commercial ones, are specialized to quite particular kinds 
of question. For example, Wolfram Alpha works perfectly 
for Which cities in China have more than 10 million in- 
habitants, but does not work if more is replaced by less or 
China by Asia, and does not even understand the question 
Which plants have edible leaves. IBM's Watson was tuned 
for finding the single most probable entity when given one of 
the (intentionally obscured) clues from the Jeopardy! game. 
And both of these systems lack transparency: it is hard to 
predict whether a question will be understood correctly, it 
is hard to understand the reasons for a missing or wrong 
answer, and there is no possibility of interaction or query 
refinement. 

For our semantic full-text search both the query language 
and the relation between a given query and its result are 
well-defined and maximally transparent to the user; see the 
discussion in Section [2. 3| The price we pay is query formu- 
lation in a non-natural language. The success of full-text 



search has shown that as long as the language is simple 
enough, it can work. 

3. INPUT DATA AND NATURAL LANGUAGE 
PRE-PROCESSING 

3.1 Input data 

Broccoli requires two kinds of inputs, a text collection and 
an ontology. The text collection consists of documents con- 
taining plain text. The ontology consists of typed relations 
with each relation containing an arbitrary set of fact triples. 
The subjects and objects of the triples are called instances. 
Each instance belongs to one or more classes. The classes 
are organized in a taxonomy; the root class is called Entity. 

3.2 Entity recognition 

The first step is to identify mentions of or referrals to 
instances from the ontology in the text documents. Consider 
the following sentence, which will be our running example 
for this section: 

(S) The usable parts of rhubarb, a plant from the Polygo- 
naceae family, are the medicinally used roots and the edible 
stalks, however its leaves are toxic. 

Both rhubarb and its refer to the instance Rhubarb from 
our ontology, which in turn belongs to the classes Plant and 
Vegetable (among others). 

Our entity recognition on the English Wikipedia is sim- 
plistic but reasonably effective. As a rule, first occurrences of 
entities in Wikipedia documents are linked to their Wikipedia 
page. When parsing a document, whenever a part or the full 
name of that entity is mentioned again in the same section 
of the document (for example, Einstein referring to Albert 
Einstein), we recognize it as that entity. 

We resolve anaphora in an equally simplistic way. Namely, 
we assign each occurrence of he, she, it, her, his, etc. to the 
last recognized entity of matching gender. We also recognize 
the pattern the < class> as the entity of the document if it 
belongs to <class>, for example, the plant in the document 
of Broccoli. 

Our results in Section |7.6| suggest that, on Wikipedia, 
these simple procedures give already a reasonable accuracy. 

3.3 Natural language processing 

The second step is to decompose document texts into what 
we call contexts, that is, sets of words that "belong" together. 
The contexts for our example sentence (S) from above are: 

(CI) rhubarb , a plant from the Polygonaceae family 

(C2) The usable parts of rhubarb are the medicinally used 
roots 

(C3) The usable parts of rhubarb are the edible stalks 
(C4) however rhubarb leaves are toxic 

This will be crucial for the quality of our results, because we 
do not want to get rhubarb in our answer set when searching 
for plants with edible leaves. Note that we assume here that 
the entity recognition and anaphora resolution have already 
been done (underlined words). Also note that we do not care 
whether our contexts are grammatically correct and form a 
readable text. This distinguishes our approach from a line 
of research called text simplification [10] . 

In the following, we will only consider contexts that are 
part of a single sentence. Indeed, after anaphora resolution, 



it seems that most simple facts are expressed within one and 
the same sentence. Our evaluation in Section [7.61 confirms 
this assumption. 

Our context decomposition consists of two parts, each de- 
scribed in the following subsections. 

3.3.1 Sentence constituent identification (SCI) 

The task of SCI is to identify the basic "building blocks" 
of a given sentence. For our purposes various kinds of sub- 
clauses and enumeration items will be important, because 
they usually contain separate facts that have no direct re- 
lationship to the other parts of the sentence. For example, 
in our sentence (S) from above, the relative clause a plant 
from the Polygonaceae family refers to rhubarb but has noth- 
ing to do with the rest of the sentence. Similarly, the two 
enumeration items the medicinally used roots and the edible 
stalks have nothing to do with each other (except that they 
both refer to rhubarb); in particular, rhubarb roots are not 
edible and rhubarb stalks are not medicinally used. Finally 
the part however its leaves are toxic needs to be considered 
separate from the preceding part of the sentence. As will 
become clear in the following, we consider these as enumer- 
ation items on the top level of the sentence. 

Formally, SCI computes a tree with three kinds of nodes: 
enumeration (ENUM), sub-clause (SUB), and concatenation 
(CONC). The leaves contain parts of the sentence and a 
concatenation of the leaves from left to right yields the whole 
sentence again. See Figure [3] for the SCI tree of the above 
sentence. 
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Figure 3: The SCI tree for our example sentence af- 
ter anaphora resolution. The head of the sub-clause 
is printed in bold. 

We construct our SCI trees based on the output of a state- 
of-the-art constituent parser. We use SENNA [II], because 
of its good trade-off between parse time (around 35ms per 
sentence) and result quality (see Section 7.61. 



We transform the parse tree using a relatively small set 
of hand-crafted rules. Here is a selection of the most im- 
portant rules; the complete list consists of only 11 rules but 
is omitted here for the sake of brevity. In the following de- 
scription when we speak of an NP (noun phrase), VP (verb 
phrase), SBAR (subordinate clause), or PP (prepositional 
phrase) we refer to nodes in the parse tree with that tag. 

(SCI 1) Mark as ENUM each node, for which the children 
(excluding punctuation and conjunctions) are either all NP 
or all VP. 

(SCI 2) Mark as SUB each SBAR. If it starts with a word 
from a positive-list (e.g., which or who) define the first NP 



on the left as the head of this SUB; this will be used in (SCR 
0) below. 

(SCI 3) Mark as SUB each PP starting with a preposition 
from a positive-list (e.g., before or while), and all PPs at the 
beginning of a sentence. These SUBs have no head. 

(SCI 4) Mark as CONC all remaining nodes and contract 
away each CONC with only text nodes in its subtree (by 
merging the respective text). 



For our definition of the answer set for a given query, we 
refer to Section 5.3 which describes the (recursive) algo- 



As our quality evaluation in Section [7.6| shows, our rules 
work reasonably well. 

3.3.2 Sentence constituent recombination (SCR) 

In SCR we recombine the constituents identified by the 
SCI to form our contexts, which will be the units for our 
search. Recall that the intuition is to have contexts such that 
only those words which "belong" together are in the same 
context. SCR recursively computes the following contexts 
from a SCI tree or subtree: 

(SCR 0) Take out each subtree labeled SUB. If a head was 
defined for it in (SCI 2), add that head as the leftmost child 
(but leave it in the SCI tree, too). Then process each such 
subtree and the remaining part of the original SCI tree (each 
of which then only has ENUM and CONC nodes left) sepa- 
rately as follows: 

(SCR 1) For a leaf, there is exactly one context: the part of 
the sentence stored in that leaf. 

(SCR 2a) For an inner node, first recursively compute the 
set of contexts for each of its children. 

(SCR 2b) If the node is marked ENUM, the set of contexts 
for this node is computed as the union of the sets of contexts 
of the children. 

(SCR 2c) If the node is marked CONC, the set of contexts 
for this node is computed as the cross-product of the sets of 
contexts of the children. 

We remark that once we have the SCI tree, SCR is straight- 
forward, and that the time for both SCI + SCR is negligible 
compared to the time needed for the full-parse of the sen- 
tences. 

4. QUERY LANGUAGE 

Queries to Broccoli are rooted trees with arcs directed 
away from the root. The root is either a class or an instance. 
There are two types of arcs: ontology arcs and occurs-with 
arcs. Both have a class or instance as source node. 

Ontology arcs are labeled by a relation from the ontology. 
The two nodes must be classes or instances matching the 
source and target type of the relation. The class or instance 
at the target node may be the root of another arbitrary tree. 

For occurs-with arcs, the target node can be an arbitrary 
set of words, prefixes, instances or classes. The instances 
or classes may themselves be the root of another arbitrary 
query. Example queries are given in Figures [T] and [2] 

To give an example of a more complex query: in Figure [l] 
we could replace the instance node Europe by a class node 
Location and add to it an occurs-with arc with the word 
equator in its target node. The intention of this query would 
be to obtain plants with edible leaves native to regions at or 
near the equator. 



rithm for processing an arbitrary given query. 

5. INDEX AND QUERY PROCESSING 

It will be instructive to first consider how a standard in- 
verted index can be used to process our semantic full-text 
queries and why that approach is bound to be slow (Section 
5.1 1. We will then describe our new index (Section 5.2 1 and 
how it can be used for fast query processing (Section |5,3[ ). 

5.1 Index: straightforward approach 

We first describe the straightforward realization of seman- 
tic full-text search with an inverted index. For example, this 
is the approach taken in |20| and 
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both using Lucene. 
For each recognized entity (see Section |3.2[ ), we simply 
add an item to the index representing the meant entity. To 
enable fast searches for classes of objects, we also need to 
add, for each occurrence of an entity, an index item for each 
class that object is in. For example, for each occurrence 
of the entity Broccoli, we would add an index item for the 
classes Plant and Vegetable as well. 

For the English Wikipedia combined with the YAGO on- 
tology, this simple approach will blow up the index size by 
a factor of 2-3, which is tolerable. What becomes a per- 
formance problem is that the inverted lists for some of the 
newly introduced items become very long. For example, the 
inverted lists for the class Person (which contains one item 
per occurrence of a person anywhere in the text) has over 
50 million index items for the English Wikipedia. Note that 
queries for a person are indeed frequent, and that we indeed 
need one index item per occurrence (and not just one per 
document). Otherwise we could not specify co-occurrence 
in the same context, which is key for our approach. Some 
queries also involve the top-level class Entity, which contains 
over 165 million index items. 

Moreover, the kind of queries we support here require 
more complex operations than just intersection on these in- 
verted lists (like in ordinary full-text search). In [5], an addi- 
tional database-style join operation is implemented. But de- 
spite a highly optimized implementation, sub-second query 
times could there only be achieved for queries involving 
classes with relatively short inverted lists, like counties and 
scientists; see (5[ Section 7.1]. As we see in the following sec- 
tion, our new index avoids inverted lists for instances and 
classes altogether. 

5.2 Index: our new approach 

Our new index is based on two key ideas. The first idea 
is taken from [fjl: use inverted lists for prefixes instead of 
words. This enables fast query suggestions. 

The second idea, which is the main idea behind our new 
index, is to have what we call context lists instead of inverted 
lists. The context list for a prefix contains one index item 
per occurrence of a word starting with that prefix, just like 
the inverted list for that prefix would. But along with that 
it also contains one index item for each occurrence of an 
arbitrary entity in the same context as one of these words. 
For example, consider the context the usable parts of rhubarb 
are its edible stalks , with recognized entities underlined. And 
let us assume that we have an inverted list for each 4-letter 
prefix. Then the part of the context list for edib* pertaining 
to this context (which has id, say, 14) would be: 



C14 C14 C14 

... #edible #Rhubarb #Stalk ... 

edib*: 

... 1 1 1 

... 8 5 9 

The numbers in the first row are context ids. The # in the 
second row means that not the actual entities (with capital 
letters) or words are stored, but rather unique ids for them. 
The third row contains the score for each index item, which 
in our current implementation is 1 for most index items, 
more than 1 for occurrences of an entity in its own docu- 
ment, and for "abstract" entities like Legume. The fourth 
row contains the position of the word or entity in the re- 
spective context. This is used for proximity search and for 
highlighting the query elements in the result snippets. The 
context lists are sorted by context id, and, for equal context 
ids, by word/entity id, with entities coming after the words. 

Relations are stored in the straightforward way, with one 
index list per relation. For example: 

... #Okra #Broccoli # Rhubarb ... 
native-to: ... ^Africa ^Europe ^Europe 
... 1 1 1 

Again the # means that ids are stored, not the actual 
entity names. The third row are the scores, which are all 
1 in our current implementation. The list is sorted by the 
second row, that is, by the target entity ids of the relation. 
Since queries may use a relation in both directions, we also 
store the reverse for each relation separately (with rows 1 
and 2 switched, and then again sorted by the ids from the 
second row). Technically, this is just another relation, for 
example, native-to (reversed). 

5.3 Query processing with the new index 

Recall from Section [4] that our queries are trees, where 
each inner node is labeled by a class or instance. However, 
in our internal query representation, each inner node is a 
(distinct) free variable and the class/instance labels are re- 
alized via additional relational arcs of the kind <var> is- a 
<class> or <var> equals <entity>. The former is more 
convenient for the user, while the latter is more convenient 
for the description of our query processing algorithm in the 
following, and also for its implementation. 

The final result for a query is always a sorted list of entity 
ids with scores, and so are intermediate results for subtrees 
with a single variable as root. The intermediate result for 
target nodes of occurs-with arcs are context lists as described 
in Section [5. 2| We process queries recursively as follows: 

Ontology arcs 

(QP 1) For each ontology arc compute the following sorted 
list of entities, where R denotes the relation of the arc: 

(QP 1.1) Recursively compute for the target node t (which 
can be the root of an arbitrary query again) the result Et, 
which is a sorted list of entity ids with scores. 

(QP 1.2) Fetch the index list Ir for the relation R, which is 
sorted by target entity; see Section [5. 2 1 

(QP 1.3) Compute the list Er of all entities x such that 
(x,y) G R for some y £ E t , via a straightforward intersection 
of Ir with E t . Since Ir is sorted by target entity id, this 
intersection can be computed efficiently in linear time. 



Occurs-with arcs 

(QP 2) For each occurs-with arc compute the following sorted 
list of entities, where W = {tui, • • • ,Wk} denotes the set of 
words or prefixes in the target node, and Z = {zi, ■ . . , Zi} 
denotes the set of entities or classes in the target node. 

(QP 2.1) For each Zi (which can be the root of an arbitrary 
query again) recursively compute its result Ei, which is a 
sorted list of entities. 

(QP 2.2) For each Wj, compute a context list as follows. 
In our index, we have a context list for each fe-letter prefix, 
for some fixed k > 1. Let / be the context list for the length- 
k prefix p of Wj, or, if Wj has length < k, for Scan 
over I and for each context, write all index items matching 
Wj (whole-word or prefix match, depending on what was 
specified in the query) to Cj , and if at least one item matches 
append all entity index items from that context, too. 

(QP 2.3) Intersect the Ci, . . . ,Ck, computed in step 2.2, 
such that the result list C contains all index items (c, e) 
where c is a context id that occurs in each of the Ci, . . . , Ck, 
and e is an entity that occurs in context c. Since the d are 
sorted by context ids, this can be computed in time N ■ log k, 
where N is the total number of index items in the Ci, . . . , Ck- 
(QP 2.4) Compute a subset C from C by keeping only those 
index items from C with a context id such that the con- 
text contains at least one entity from each of the E\ , . . . , Ei 
computed in Step 2.1. This can be done in time linear in 
C + | i?i | + ■ ■ ■ + \Ei\, by temporarily storing each Ei in a 
hash map. 

(QP 2.5) Extract all entities from C", aggregate the scores of 
all postings with the same id using summation and produce 
a result list that is sorted by entity id. 

Combining the sub-results for each arc 

(QP 3) Let Ei, ... , E m be the entity lists computed in steps 
2 (one list per ontology arc) and 3 (one list per occurs-with 
arc). The result list for the whole query is then simply the 
intersection of the E\ , . . . , E m , where the scores of index 
items with the same entity id are again simply summed up. 

Excerpts 

Excerpts of the kind shown in Figure [l] are easily produced 
from the intermediate context lists computed in (QP 2.3). 
For each instance displayed to the user, we simply sum up 
the scores of the matching index items and show the ones 
with the highest sum. 

6. USER INTERFACE 

For a convincing proof of concept for our interactive se- 
mantic search, we have taken great care to implement a fully 
functional and intuitive user interface. In particular, there 
is no need for the user to formulate queries in a language 
like SPARQL. 

The introduction and screenshots (Figures [l] and [5J have 
already provided a foretaste of the capabilities of our user 
interface. Here is a list of further features. Due to the space 
constraint, we mention only the most important ones: 

(UI 1) Search as you type: new suggestions and results 
with every keystroke. Very importantly, Broccoli's sugges- 

4 Words of length < k get a context list on their own, and 
there are only single- word suggestions for prefixes of length 
< k. 



tions for words, classes, instances, and relations are context- 
sensitive. That is, the displayed suggestions actually lead 
to hits, and the more / higher-scored hits they lead to, the 
higher they are ranked. 

(UI 2) Pre-select of most likely suggestion: Broccoli knows 
four kinds of objects: words, classes, instances, and rela- 
tions. Depending on where you are in the query construc- 
tion, you get suggestions for several of them. A new user 
may be overwhelmed to understand the different semantics 
of the different boxes. For that reason, after every keystroke 
Broccoli highlights the most meaningful suggestion, which 
can be selected by simply pressing Return. 

(UI 3) Visual query representation: At any time, the current 
query is shown as a tree, with a color code for the various 
elements that is consistent with the suggestion boxes. 
(UI 4) Change of focus / root: A click on any node in the 
query tree will change the focus of the query suggestions 
to that node. A double-click on any class or instance node 
will make that node the root of the tree and re-group and 
re-rank the results accordingly. 

7. EXPERIMENTS 

7.1 Input data 

Our text collection is the text from all documents in the 
English Wikipedia, obtained via download.wikimedia.org 
Performance evaluations have been performed on a version 
downloaded in September 2011. The quality evaluation (Sec- 
tion |7.6[ ) has been performed on a more recent download 
from June 2012. Some dimensions of the September 2011 
collection: 32 GB XML dump, 1.1 billion word occurrences, 
165 million recognized entity occurrences (see Section [3.2[ ), 
and 153 million sentences which we decompose into 290 mil- 
lion contexts (see Section [3. 3| |. 

As ontology we use the latest version of YAGO from Oc- 
tober 2009. We manually fixed 15 obvious mistakes in the 
ontology (for example, the noble prize was a laureate and 
hence a person) , and added the relation Plant native-in Lo- 
cation for demonstration purposes. Altogether our variant 
of YAGO contains 2.6 million entities, 17,661 classes, 23 re- 
lations, and 26 million facts. 

7.2 Computing environment 

The code for the index building and query processing is 
written entirely in C++. The code for the query evaluation 
is written in Perl, Java, and C++. Our pre-processing com- 
ponents (mainly entity-recognition, full parse, SCI + SCR) 
are written in C++ or Java and embedded into a UIMA|j 
pipeline. The full parse was scaled out asynchronously on a 
cluster of 8 PCs, each equipped with an AMD FX-8150 8- 
core processor and 16 GB of main memory. All performance 
tests were run on a single core of a Dell PowerEdge server 
with 2 Intel Xeon 2.6 GHz processors, 96 GB of main mem- 
ory, and 6x900 GB SAS hard disks configured as Raid-5. 

7.3 Pre-processing times and space usage 

The parallelized full parse using SENNA takes around 19 
hours. Basic parsing of the Wikipedia with our simple en- 
tity recognition, anaphora resolution and SCI + SCR takes 
around 6 hours. Going from there, building the index lists 
takes another 3 hours. 



Our index is kept in three separate files. The file for the 
context lists has a size of 42 GB. The total number of post- 
ings is 2.8 times as much as in a standard full-text index. 
The file for the relation lists has a size of 1 GB. The file 
for the document excerpts has a size of 73 GB, which could 
easily be reduced to 51 GB by eliminating the redundant 
information the file currently contains. Compression, which 
we currently do not use anywhere, could reduce the size of 
these files further. 

7.4 Query times, hits 

A central aspect of Broccoli is its interactivity. Table 
[T] provides the average response times for eight types of 
queries: (Ql) full-text only, one word; (Q2) full-text only, 
two words; (Q3) ontology only, one arc; (Q4) class occurs- 
with one word; (Q5) class occurs-with two words; (Q6) on- 
tology arc + occurs-with word; (Q7) ontology arc + occurs- 
with word and class; (Q8) ontology arc + occurs-with word 
and (class occurs-with one word). 



Query type 


av-time 


read 


agg 


filter 


i+m+r 


text-only 1 


135ms 


43ms 


22ms 


29ms 


35ms 


text-only 2 


179ms 


106ms 


< 1ms 


53ms 


10ms 


onto-only 


2ms 


1ms 


< 1ms 


< 1ms 


< 1ms 


onto+text 1 


89ms 


28ms 


11ms 


44ms 


1ms 


onto+text 2 


112ms 


69ms 


14ms 


31ms 


5ms 


onto+text 3 


57ms 


19ms 


10ms 


25ms 


< 1ms 


onto+text 4 


119ms 


41ms 


3ms 


60ms 


1ms 


onto+text 5 


161ms 


63ms 


13ms 


75ms 


1ms 
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Table 1: Average query times for eight types of 
queries supported by Broccoli. 

We synthetically generated 1,000 queries for each type. 
For each query, we select each word or class as follows, start- 
ing from the root. Pick a random prefix of length 2, and 
consider Broccoli's top suggestions for the query built so far 
(top 30 for words, top 5 for classes). Note that each of these 
suggestions will give a non-empty result set. Pick a random 
one of those suggestions. If no suggestion exists, try a dif- 
ferent random prefix. If 10 such attempts fail, start again 
from the root for that query. 

The times reported in this table are for computing and 
showing the hits, that is, the contents of the large box on 
the right in Figure [l] 

We observe that the average query times are around 100ms 
and that the bulk of the query time is spent in reading lists 
from the index. The other three columns provide the times 
for score aggregation (in QP 2.5 and QP 3 from Section 5.3 1, 
filtering (in QP 2.2 and QP 2.4), and intersection, merging 
and ranking of result lists (in QP 2.3 and QP 2.5). 

7.5 Query times, suggestions 

Table [2] provides the times for the various suggestions to 
appear in response to a single query (recall Figure |2j. We 
differentiate here between four kinds of stations in the query 
formulation process: (SI) type something in the beginning; 
(S2) type something to add an arc, refine a class, or restrict 
to an instance; (S3) type something to add something to the 
target node of an ontology arc; (S4) type something to add 
something to the target node of an occurs-with arc. We take 
the queries from classes (Q3) and (Q4) above, respectively, 



and type the respective element from left to right, starting 
with a prefix length of 3 for entities and 4 for words. Note 
that suggestions for more complex queries (Q5-8) would be 
faster and not slower, because the results sets are smaller. 



Station 


prefix-len 3 


prefix-len 4 


prefix-len > 5 


beginning 


< 1ms 


< 1ms 


< 1ms 


add arc 


4ms 


4ms 


4ms 


onto node 


1ms 


< 1ms 


< 1ms 


occw node 




93ms 


73ms 




(r44%,f54%) 


(r58%,f39%) 



Table 2: Query suggestion times for four different 
stations in the query formulation process. 

The table shows that only the word suggestions cost sig- 
nificant time, and that that time is used mainly for reading 
index lists and /iltering, see above. 

7.6 Result quality 

For our quality evaluation, we considered the topics and 
relevance judgements from the TREC 2009 Entity Track [3]. 
We removed all relevance judgements for pages that are not 
in the English Wikipedia; this approach was taken before 
in [9] as well. This leaves us with 15 queries with a decent 
number of relevance judgements. 

Similarly as in [5j, we also evaluated our system on a ran- 
dom selection of Wikipedia's List of ... pages with similar 
results as for the TREC benchmark. For the sake of brevity, 
details are omitted for this second benchmark. 

We first evaluated the impact of our context decomposi- 
tion from Section 3.3 {contexts) on result quality, by com- 
paring it against two simple baselines: taking each sentence 
as one context {sentences) and taking each section as one 
context {sections). 





#FP 


#FN 


Precision 


Recall 


Fl 


sections 


6.890 


19 


4.7% 


81.6% 


8.2% 


sentences 


392 


38 


39.2% 


65.0% 


36.8% 


contexts 


297 


36 


44.9% 


66.5% 


45.6% 



Table 3: Sum and averages over all TREC queries 
for Broccoli on sections, sentences and contexts. 

Table [3] shows that compared to sentences, our contexts 
decrease the (large) number of false-positives significantly. 
Even the number of false-negatives decreases. This is be- 
cause our document parser pre-processes lists by appending 
each list item to the preceding sentence (before the SCI+SCR 
phase). These are the only types of contexts crossing sen- 
tence boundaries and a rare exception. Since using sentences 
also led to a lower recall in addition to the expected lower 
precision, we also evaluated on sections. We can observe 
a decrease in the number of false-negatives (a lot of them 
due to random co-occurrence of query words) which does 
not outweigh the drastic increase of the number of false- 
positives. Overall, context decomposition results in a signif- 
icantly increased precision and F-Measurqj which confirms 



the positive impact on the user experience that we have ob- 
served. 

We manually investigated the reasons for the false-positives 
and false-negatives when using contexts. We defined the fol- 
lowing error categories. For false-positives: (FP1) a true hit 
which was missing from the ground truth; (FP2) the words 
in the context have a different meaning than what was in- 
tended by the query; (FP3) due to an error in the ontology; 
(FP4) a mistake in the entity recognition; (FP5) a mistake 
by the parser. (FP6) a mistake in our context decomposition. 
For false-negatives: (FN1) there seems to be no evidence for 
this entity in the Wikipedia based on the query we used. It 
is possible that the fact is present but expressed differently, 
e.g., by the use of synonyms of our query words; (FN2) 
the query elements are spread over two or more sentences; 
(FN3) a mistake in the ontology; (FN4) a mistake in the 
entity recognition; (FN5) a mistake by the parser; (FN6) a 
mistake in our context decomposition. 



#FP 


FP1 


FP2 


FP3 


FP4 


FP5 


FP6 


297 


55% 


11% 


5% 


12% 


16% 


1% 




#FN 


FN1 


FN2 


FN3 


FN4 


FN5 


FN6 


36 


22% 


6% 


26% 


21% 


16% 


8% 



F-Measure increase is highly statistically significan t (p - 
value of 0.002 for two-sided Fisher's randomization test |12|). 



Table 4: Breakdown of errors by category. 

Table [4] provides the percentage of errors in each of the- 
ses categories. The high number in FP1 is great news for 
us: many entities are missing from the ground truth but 
were found by Broccoli. Errors in FN1 occur when full- 
text search with our queries on whole Wikipedia documents 
does not yield hits, independent from our contexts. Tun- 
ing queries or adding support for synonyms can decrease 
this number. FP2 and FN2 comprise the most severe er- 
rors. They contain false-positives that still match all query 
parts in the same context but have a different meaning and 
false-negatives that are lost because contexts are confined to 
sentence boundaries. Fortunately, both numbers are quite 
small. 

The errors in categories FP/FN 3-5 depend on implemen- 
tation details and third-party components. Most of them 
(especially 3+4) can be corrected comparatively easy. Parse 
errors are harder. Assuming a perfect constituent parse for 
every single sentence, especially those with flawed grammar, 
is not realistic. Still, those errors do not expose limits of 
our approach. We hope to enable SCI+SCR without a full- 
parse in the future (see Section[8j). The low number of errors 
due to our context decomposition (FP6+FN6) demonstrates 
that our current approach (Section 3.3 1 is already pretty 
good. Fine-tuning the way we decompose sentences might 
decrease this number even further. 

Naturally, an evaluation should not treat entities missing 
in the ground-truth in the same way as actual errors. Ta- 
ble [5] provides quality measures for our benchmark based 
on sentences and contexts under three conditions: {origi- 
nal) evaluation based on the original TREC ground-truth; 
{-hmissing) with the entities from FP1 added to the ground 
truth; {+correct) with the errors leading to FP and FN 3,4,5 
corrected. 

The numbers for +correct show the high potential of our 
system and motivate further work correcting the respective 
errors. As argued in the discussion after Table [4] many 
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R-Prec 


MAP 


nDCG 


Sentences 


original 


0.57 


0.60 


0.44 


0.49 


+missing 


0.76 


0.75 


0.58 


0.63 




original 


0.57 


0.62 


0.44 


0.53 


Contexts 


+missing 


0.79 


0.77 


0.62 


0.69 




+correct 


0.94 


0.92 


0.83 


0.86 



Table 5: Quality measures on TREC 2009 queries 
for three different levels of corrections. 

corrections are easily applied, while some of them remain 
hard to correct perfectly. 

Already with respect to the original ground truth, R- 
prccision is 0.62. This is better than the R-prccision of 0.55 
reported in [9j Table 10] for the best run from the TREC 
2009 Entity Track when restricted to the English Wikipedia. 
When adding the missing entities to the ground truth, R- 
precision rises to 0.77. However, comparing the sentence 
and the context approach we see relatively little improve- 
ment. This is surprising, especially since we have reported a 
highly significant increase in F-measure in Table |3] The rea- 
son is simple. The measures in Tableware highly dependent 
on a good ranking. So far our system uses simplistic ranks, 
determined by mere term frequency. We plan to improve 
on that in the future; see Section [8] We want to stress the 
following though. Many semantic queries, including all from 
our TREC benchmark, have a small set of relevant results. 
We believe that for such queries the quality of the result set 
as a whole (as reflected by the figures of Table [3} is more im- 
portant than the ranking within the result set (as reflected 
by the figures of Table |EJ. 

8. CONCLUSIONS AND FUTURE WORK 

We have presented Broccoli, a search engine for the inter- 
active exploration of combined text and ontology data. We 
have described the index, the natural language processing, 
and the user interface behind Broccoli. And we have pro- 
vided evidence that Broccoli is indeed fast and gives search 
results of good quality. 

So far, we have implemented all the basic ideas we deemed 
necessary to provide a convincing proof of concept. But 
there is much room for improvement and future work: (1) 
the query processing could be optimized in a number of 
ways; (2) clever caching strategies have the potential to fur- 
ther improve query times significantly; (3) improve on the 
entity recognition and anaphora resolution; (4) a smarter 
ranking; (5) evaluate Broccoli on a larger, web-like collec- 
tion; (6) efficient high-quality SCI+SCR without the need 
for an expensive and error-prone full parse; (7) integrate sim- 
ple inference heuristics to reduce the number of FN2-type 
errors; (8) a user study of our UI and the whole system. 
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